Skip to content

refactor(dialect): extract shared RegexSafety and SqlEmitters helpers#31

Merged
richardwooding merged 1 commit into
mainfrom
refactor/dedupe-dialect-regex-and-emitters
Jun 14, 2026
Merged

refactor(dialect): extract shared RegexSafety and SqlEmitters helpers#31
richardwooding merged 1 commit into
mainfrom
refactor/dedupe-dialect-regex-and-emitters

Conversation

@richardwooding

Copy link
Copy Markdown
Contributor

Summary

Removes duplicated code across the dialect layer that was surfaced by analyzing the repo with the file-search-on tools (find_duplicate_functions, complexity, code_graph). Two clusters were duplicated across the per-dialect classes; both are now consolidated into focused helpers in the dialect package, with each dialect keeping an explicit override that delegates.

1. RegexSafety — shared ReDoS validation (CWE-1333)

All five RE2-style dialects (PostgresRegex, MySqlRegex, DuckDbRegex, BigQueryRegex, SparkRegex) carried a byte-identical copy of:

  • the MAX_PATTERN_LENGTH / MAX_GROUPS / MAX_NESTING_DEPTH limits and the two detector patterns;
  • the three helpers validateNoNestedQuantifiers (the repo's single most complex function, repeated 5×), countUnescapedParens, computeMaxNestingDepth.

Now exposed as RegexSafety.checkLength(...) + RegexSafety.checkReDoS(...).

Bonus fix: the same nested-quantifier rejection previously threw "Invalid pattern in expression" in Postgres/DuckDB/Spark but "Invalid regex pattern" in MySQL/BigQuery. All dialects now emit the former consistently.

2. SqlEmitters — shared SQL-fragment emitters

The six *Dialect classes duplicated several emitter bodies. Consolidated into parameterized helpers:

Helper Used by Method(s)
writeBinaryCall postgres, duckdb, bigquery, spark writeSplit
writeArrayJoin postgres, duckdb, bigquery, spark writeJoin
writeJsonEachMembership duckdb, sqlite writeJSONArrayMembership, writeNestedJSONArrayMembership
writeJsonPathProbe duckdb, bigquery, sqlite, spark writeJSONExtractPath
writeInfixRegex postgres, duckdb, mysql, spark writeRegexMatch
writeStandardExtract / writeExtractWithPostgresDow all 6 / postgres+duckdb writeExtract
writeArrowJsonAccess postgres, duckdb writeJSONFieldAccess

Per-dialect field-name escaping is threaded through as a method reference, so BigQuery's \' escaping is preserved alongside the '' used elsewhere. Dialects whose output genuinely differs (e.g. SQLite's unsupported-op throws, MySQL's JSON_UNQUOTE join) keep their inline implementation.

Impact

  • ~878 fewer lines across the 11 dialect files, replaced by two focused shared classes.
  • find_duplicate_functions now reports zero duplicate groups among these methods at the standard threshold.

Testing

  • ./gradlew test passes. The conversion tests assert exact SQL output per dialect, so behavior preservation is directly verified.
  • Integration tests (Docker/Testcontainers) were not run locally; CI will exercise them.

🤖 Generated with Claude Code

The five RE2-style dialects each carried a byte-identical copy of the
ReDoS-safety validation (length/group/nesting limits, nested-quantifier
and quantified-alternation detection, and the three helper methods), and
the six dialect implementations duplicated several SQL-fragment emitters
(EXTRACT, array join, regex match, JSON path probe, json_each membership,
binary-function split, arrow JSON access).

Consolidate both into two focused helpers in the dialect package:

- RegexSafety: checkLength + checkReDoS plus the shared limits/patterns.
  Also normalizes the rejection message — MySQL/BigQuery previously
  emitted "Invalid regex pattern" while the others emitted "Invalid
  pattern in expression"; all dialects now use the latter consistently.
- SqlEmitters: writeBinaryCall, writeArrayJoin, writeJsonEachMembership,
  writeJsonPathProbe, writeInfixRegex, writeStandardExtract /
  writeExtractWithPostgresDow, and writeArrowJsonAccess. Per-dialect
  field-name escaping is threaded through as a method reference so
  BigQuery's distinct escaping is preserved.

Each dialect keeps its own override and delegates the body; dialects
whose output genuinely differs keep their inline implementation. Net
~878 fewer lines across the 11 dialect files. No behavioral change —
the per-dialect SQL-output tests pass unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@richardwooding richardwooding merged commit b05bcc0 into main Jun 14, 2026
4 checks passed
@richardwooding richardwooding deleted the refactor/dedupe-dialect-regex-and-emitters branch June 14, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant